tools/synthetic data generation/synthetic_data_generation_using

{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "7331ed20-4a7e-4ee3-8454-528ac48315f0", "metadata": { "id": "7331ed20-4a7e-4ee3-8454-528ac48315f0" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "source": [ "# Synthetic Data Generation Use Case with Vertex AI and Snowfakery\n", "\n", "This notebook provides a comprehensive solution for generating synthetic data using Vertex AI Gemini generative model and the Snowfakery data generation tool.\n", "It leverages the power of these technologies to create realistic and customizable synthetic datasets based on existing data schemas and user inputs.\n", "\n", "## Key Features\n", "\n", "* **Schema Extraction:** Extracts table schemas from BigQuery to understand the structure and data types of your source data.\n", "* **Sample Data Analysis:** Analyzes sample data from BigQuery tables to inform the generation of Snowfakery recipes.\n", "* **Automated Recipe Generation:** Uses Gemini to automatically generate Snowfakery recipes based on table schemas and sample data.\n", "* **Recipe Refinement:** Provides prompts and tools for refining and correcting generated recipes, including handling relationships between tables.\n", "* **Data Generation with Snowfakery:** Utilizes Snowfakery to generate synthetic data based on the refined recipes and user-specified counts.\n", "* **Data Storage and Loading:** Uploads generated data to Google Cloud Storage (GCS) and loads it into target BigQuery tables.\n", "\n", "## Workflow\n", "\n", "1. **Installation and Setup:**\n", " - Install the required packages: Vertex AI SDK and Snowfakery.\n", " - Restart the runtime to ensure the newly installed packages are available.\n", " - Authenticate your notebook environment if using Google Colab.\n", " - Set your Google Cloud project ID and location, and initialize the Vertex AI SDK.\n", "\n", "2. **Data Extraction and Analysis:**\n", " - Specify the source BigQuery tables for data extraction.\n", " - Define the number of sample rows to fetch from each table.\n", " - Extract the schema of each source table.\n", " - Read sample data from the source tables and upload it to GCS.\n", "\n", "3. **Recipe Generation and Refinement:**\n", " - Use Gemini to generate initial Snowfakery recipes based on the extracted schemas and sample data.\n", " - Refine the recipes using prompts and tools to handle data types, relationships, and constraints.\n", " - Correct any syntax errors or inconsistencies in the recipes.\n", "\n", "4. **Data Generation and Loading:**\n", " - Specify the desired number of records to generate for each table.\n", " - Use Snowfakery to generate synthetic data based on the refined recipes and specified counts.\n", " - Upload the generated data to GCS.\n", " - Specify the target BigQuery tables for loading the synthetic data.\n", " - Load the data from GCS into the target BigQuery tables.\n", "\n", "## Customization Options\n", "\n", "* **Source and Target Tables:** Modify the `source_table_list` and `target_table_list` parameters to specify your desired tables.\n", "* **GCS Bucket:** Change the `gcs_bucket_name` parameter to use your own GCS bucket.\n", "* **Sample Data Size:** Adjust the `total_rows_to_fetch_from_source_table` parameter to control the number of sample rows used for recipe generation.\n", "* **Generated Data Counts:** Update the `user_counts` dictionary to specify the desired number of records for each table.\n", "* **Fake Function List:** Modify the `fake_function_list` to include or exclude specific fake functions for data generation.\n", "* **Prompts:** Customize the prompts used for recipe generation, correction, and relationship building to tailor the process to your needs.\n", "\n", "## Benefits\n", "\n", "* **Data Privacy:** Generate synthetic data that resembles real data without exposing sensitive information.\n", "* **Data Augmentation:** Increase the size of your datasets for improved model training and testing.\n", "* **Data Diversity:** Create synthetic data with specific characteristics or distributions to address imbalances or biases in your original data.\n", "* **Data Sharing:** Share synthetic data with collaborators or external parties without compromising privacy.\n", "\n", "## Conclusion\n", "\n", "This notebook provides a powerful and flexible solution for generating synthetic data. By combining the capabilities of Vertex AI's Gemini and Snowfakery, you can create high-quality synthetic datasets that meet your specific needs and contribute to more robust and ethical data analysis and machine learning." ], "metadata": { "id": "WzqUmHirFGfG" }, "id": "WzqUmHirFGfG" }, { "cell_type": "markdown", "id": "e6aa9ed3-54c1-48e3-8c8f-e8f3cf152ec3", "metadata": { "id": "e6aa9ed3-54c1-48e3-8c8f-e8f3cf152ec3" }, "source": [ "# Synthetic Data Generation Usecase" ] }, { "cell_type": "markdown", "id": "e762db26-12ed-4287-b79f-37de312c6b59", "metadata": { "id": "e762db26-12ed-4287-b79f-37de312c6b59" }, "source": [ "| | |\n", "|-|-|\n", "| Author | [Anas Aslam](https://github.com/aaslam17)" ] }, { "cell_type": "markdown", "id": "56bad415-a400-42c3-800f-5c700a6c6cd7", "metadata": { "id": "56bad415-a400-42c3-800f-5c700a6c6cd7" }, "source": [ "## Getting Started\n", "\n", "### Install Vertex AI SDK and other required packages" ] }, { "cell_type": "code", "source": [ "%pip install --upgrade --user -q google-cloud-aiplatform snowfakery==3.6.3" ], "metadata": { "id": "6cM0HpybppRF", "collapsed": true }, "id": "6cM0HpybppRF", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "id": "5c810968-005f-4776-8d2b-99e04a49b550", "metadata": { "id": "5c810968-005f-4776-8d2b-99e04a49b550" }, "source": [ "### Restart runtime\n", "\n", "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", "\n", "The restart might take a minute or longer. After it's restarted, continue to the next step." ] }, { "cell_type": "code", "execution_count": null, "id": "1e0edafe-5cf9-4979-87fa-79958402f9dc", "metadata": { "id": "1e0edafe-5cf9-4979-87fa-79958402f9dc" }, "outputs": [], "source": [ "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "id": "b27adccc-967a-469c-a05e-5a1eabaef1a3", "metadata": { "id": "b27adccc-967a-469c-a05e-5a1eabaef1a3" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n", "</div>" ] }, { "cell_type": "markdown", "id": "299649d4-7959-455a-8971-d0d3bb5cc474", "metadata": { "id": "299649d4-7959-455a-8971-d0d3bb5cc474" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "If you are running this notebook on Google Colab, run the cell below to authenticate your environment." ] }, { "cell_type": "code", "execution_count": null, "id": "e641fb7d-2ed9-492e-ba8e-fabc86721630", "metadata": { "id": "e641fb7d-2ed9-492e-ba8e-fabc86721630" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user(project_id=\"project_id\")" ] }, { "cell_type": "markdown", "id": "95c810db-eeea-4475-8c03-9016429272f8", "metadata": { "id": "95c810db-eeea-4475-8c03-9016429272f8" }, "source": [ "### Set Google Cloud project information and initialize Vertex AI SDK\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "id": "529b3f2d-c270-4937-9b70-16651f260125", "metadata": { "id": "529b3f2d-c270-4937-9b70-16651f260125" }, "outputs": [], "source": [ "PROJECT_ID = \"project_id\" # @param {type:\"string\"}\n", "LOCATION = \"region_id\" # @param {type:\"string\"}\n", "\n", "import vertexai\n", "vertexai.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "id": "a0b04f91-62a4-48cf-8a84-50d41aaf5d79", "metadata": { "id": "a0b04f91-62a4-48cf-8a84-50d41aaf5d79" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "source": [ "from google.cloud import bigquery\n", "from google.cloud import storage\n", "import pandas as pd\n", "from datetime import datetime\n", "import io\n", "import google.api_core.exceptions\n", "from vertexai.generative_models import GenerativeModel,GenerationConfig,Part\n", "from snowfakery.plugins import SnowfakeryPlugin\n", "from snowfakery import generate_data\n", "from io import StringIO\n", "from IPython.display import display, Markdown\n", "import yaml\n", "import json" ], "metadata": { "id": "-JvCJJZkDFyx" }, "id": "-JvCJJZkDFyx", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Get the user inputs" ], "metadata": { "id": "fnKhhht8Mydt" }, "id": "fnKhhht8Mydt" }, { "cell_type": "code", "source": [ "# Semicolon-separated list of source BigQuery tables for retrieving sample data\n", "source_table_list=\"project_id.dataset_id.table_id1;project_id.dataset_id.table_id2;project_id.dataset_id.table_id3\"\n", "\n", "# Semicolon-separated list of target BigQuery tables for loading generated data.\n", "target_table_list=\"project_id.dataset_id.table_id1_dest;project_id.dataset_id.table_id2_dest;project_id.dataset_id.table_id3_dest\"\n", "\n", "# Number of rows to fetch from each source table for analysis and recipe generation.\n", "total_rows_to_fetch_from_source_table=30\n", "\n", "# Name of the Google Cloud Storage bucket for storing intermediate and output data.\n", "gcs_bucket_name=\"gcs_bucket_name\"\n", "\n", "# Dictionary specifying the desired number of records to generate for each table.\n", "user_counts = {\n", " \"table_id1\": 5,\n", " \"table_id2\": 10,\n", " \"table_id3\": 50\n", "}" ], "metadata": { "id": "Mr1KbwWmF8qQ" }, "id": "Mr1KbwWmF8qQ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Read the source table , extract sample records and upload to GCS Bucket" ], "metadata": { "id": "CjbJmTGsOwEy" }, "id": "CjbJmTGsOwEy" }, { "cell_type": "code", "source": [ "def read_table_and_upload_to_gcs(project_id, dataset_id, table_id, bucket_name,blob_name,row_limit):\n", " \"\"\"\n", " Reads from a BigQuery table and uploads to GCS.\n", "\n", " Args:\n", " project_id: Your Google Cloud project ID.\n", " dataset_id: The BigQuery dataset ID.\n", " table_id: The BigQuery table ID.\n", " bucket_name: The name of your GCS bucket.\n", " blob_name: The name of the blob (file) to create in the GCS bucket.\n", " row_limit: The maximum number of rows to read from the table.\n", " \"\"\"\n", "\n", " # Construct the table reference string\n", " table_ref = f\"{project_id}.{dataset_id}.{table_id}\"\n", "\n", " # Establish a BigQuery client connection\n", " bq_client = bigquery.Client(project=project_id)\n", "\n", " # Create a query job with the row limit\n", " query_job = bq_client.query(f\"SELECT * FROM `{table_ref}` ORDER BY RAND() LIMIT {row_limit}\")\n", "\n", " # Get query results as a pandas DataFrame\n", " df = query_job.to_dataframe()\n", "\n", " # Create a GCS client\n", " storage_client = storage.Client()\n", "\n", " # Get a reference to the GCS bucket\n", " bucket = storage_client.bucket(bucket_name)\n", "\n", " # Create a blob (file) in the bucket\n", " blob = bucket.blob(blob_name)\n", "\n", " # Upload the DataFrame as a CSV file to GCS\n", " blob.upload_from_string(df.to_csv(index=False), content_type=\"text/csv\")\n", "\n", " print(f\"Uploaded {len(df)} rows to gs://{bucket_name}/{blob_name}\")\n" ], "metadata": { "id": "bjYItOD0TYB7" }, "id": "bjYItOD0TYB7", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "nPN6VjS8O3Cu" }, "id": "nPN6VjS8O3Cu" }, { "cell_type": "markdown", "source": [ "# Read the Source Table and Extract the Table Schema" ], "metadata": { "id": "UahHrgHKSD2c" }, "id": "UahHrgHKSD2c" }, { "cell_type": "code", "source": [ "def extract_schema(project,dataset_id,table_id):\n", " \"\"\"\n", " Extracts the schema of a BigQuery table as a JSON string.\n", "\n", " Args:\n", " project: The ID of the Google Cloud project.\n", " dataset_id: The ID of the BigQuery dataset.\n", " table_id: The ID of the BigQuery table.\n", "\n", " Returns:\n", " A JSON string representing the table schema.\n", " \"\"\"\n", "\n", " client = bigquery.Client() # Initialize BigQuery client\n", " dataset_ref = client.dataset(dataset_id, project=project) # Get dataset reference\n", " table_ref = dataset_ref.table(table_id) # Get table reference\n", " table = client.get_table(table_ref) # Get table object\n", "\n", " f = io.StringIO(\"\") # Create a string buffer\n", " client.schema_to_json(table.schema, f) # Write schema to buffer as JSON\n", " return f.getvalue() # Return the JSON string" ], "metadata": { "id": "xo_5X2gUgx3O" }, "id": "xo_5X2gUgx3O", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Generate Synthetic Data based on the Input Recipe" ], "metadata": { "id": "zuolYOQgUFkp" }, "id": "zuolYOQgUFkp" }, { "cell_type": "code", "source": [ "def generate_data_with_user_input(recipe):\n", " \"\"\"\n", " Generates synthetic data using Snowfakery based on user input.\n", "\n", " Args:\n", " recipe: The Snowfakery recipe as a string.\n", " output_format: The desired output format (e.g., csv, json).\n", " output_folder: The path to the folder where output will be saved.\n", " \"\"\"\n", "\n", " # Generate data using Snowfakery\n", " generate_data(\n", " StringIO(recipe),\n", " output_format=\"csv\",\n", " output_folder=\"/output\"\n", " )\n", "\n", " print(\"Data generation complete. Check the output folder.\")\n", "\n" ], "metadata": { "id": "E46VdnyWWdT6" }, "id": "E46VdnyWWdT6", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Upload the generated data file from local to GCS bucket" ], "metadata": { "id": "Ttu7bgowUOvn" }, "id": "Ttu7bgowUOvn" }, { "cell_type": "code", "source": [ "def upload_from_local_to_gcs(bucket_name, blob_name, local_file_path):\n", " \"\"\"\n", " Uploads a CSV file fron local storage to Google Cloud Storage.\n", "\n", " Args:\n", " bucket_name: The name of the GCS bucket.\n", " blob_name: The name of the blob (file) in GCS.\n", " file_path: The local path to the CSV file.\n", " \"\"\"\n", "\n", " # Initialize GCS client\n", " client = storage.Client()\n", "\n", " # Get a reference to the bucket\n", " bucket = client.bucket(bucket_name)\n", "\n", " # Create a blob object\n", " blob = bucket.blob(blob_name)\n", "\n", " # Read the CSV file into a Pandas DataFrame\n", " df = pd.read_csv(local_file_path)\n", "\n", " # Upload the DataFrame as a CSV file to GCS\n", " blob.upload_from_string(df.to_csv(index=False), content_type=\"text/csv\")\n", "\n", " print(f\"Uploaded local file to gs://{bucket_name}/{blob_name}\")\n", " print(f\"The uploaded file contains \"+str(len(df))+ \" records\")" ], "metadata": { "id": "3I-klRomaozz" }, "id": "3I-klRomaozz", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Load the GCS File to Target Bigquery Table" ], "metadata": { "id": "4dvOIf0hUZ1J" }, "id": "4dvOIf0hUZ1J" }, { "cell_type": "code", "source": [ "def copy_table_schema(source_project_id,dest_project_id,source_dataset_id,dest_dataset_id, source_table_id,dest_table_id,):\n", " \"\"\"\n", " Creates a new Target BigQuery table with the same schema as an Source table.\n", "\n", " Args:\n", " source_project_id: Your Source Google Cloud project ID.\n", " dest_project_id: Your Target Google Cloud project ID.\n", " source_dataset_id: The dataset ID of the source table.\n", " source_table_id: The table ID of the source table.\n", " dest_dataset_id: The dataset ID where the new table will be created.\n", " dest_table_id: The desired table ID for the new table.\n", " \"\"\"\n", "\n", " # Construct table reference strings\n", " source_table_ref = f\"{source_project_id}.{source_dataset_id}.{source_table_id}\"\n", " dest_table_ref = f\"{dest_project_id}.{dest_dataset_id}.{dest_table_id}\"\n", "\n", " # Establish a BigQuery client connection\n", " bq_client = bigquery.Client()\n", "\n", " # Get the source table's information\n", " source_table = bq_client.get_table(source_table_ref)\n", "\n", " # Create a new table with the same schema (fields)\n", " table = bigquery.Table(dest_table_ref, schema=source_table.schema)\n", " bq_client.create_table(table)\n", "\n", " print(f\"Created table {dest_table_ref} with the same schema as {source_table_ref}\")\n", "\n", "def gcs_to_bigquery(source_project_id,dest_project_id,source_dataset_id,dest_dataset_id, source_table_id,dest_table_id, bucket_name, file_name):\n", " \"\"\"\n", " Reads the receipe output from GCS bucket and loads it into a BigQuery table.\n", " Create the destination table, if the table doesnot exist\n", "\n", " Args:\n", " source_project_id: Your Source Google Cloud project ID.\n", " dest_project_id: Your Target Google Cloud project ID.\n", " source_dataset_id: The dataset ID of the source table.\n", " source_table_id: The table ID of the source table.\n", " dest_dataset_id: The dataset ID of the target table\n", " dest_table_id: The table ID of the target table\n", " bucket_name: The name of the GCS bucket.\n", " blob_name: The name of the output file to load to bigquery\n", " \"\"\"\n", " # Construct the table reference string\n", " dest_table_ref = f\"{dest_project_id}.{dest_dataset_id}.{dest_table_id}\"\n", "\n", " # Establish a BigQuery client connection\n", " bq_client = bigquery.Client(project=source_project_id)\n", "\n", " try:\n", " bq_client.get_table(dest_table_ref)\n", " print(f\"Table {dest_table_id} exists in dataset {dest_dataset_id}\")\n", " except google.api_core.exceptions.NotFound:\n", " copy_table_schema(source_project_id,dest_project_id,source_dataset_id,dest_dataset_id, source_table_id,dest_table_id)\n", "\n", " #blob_name=f\"{dest_project_id}/{dest_dataset_id}/{dest_table_id}/output/\"+file_name;\n", " blob_name=file_name;\n", "\n", " # Get a reference to the GCS bucket and blob\n", " storage_client = storage.Client()\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(blob_name)\n", "\n", " # Configure the BigQuery job\n", " job_config = bigquery.LoadJobConfig(\n", " source_format=bigquery.SourceFormat.CSV, # Adjust if your file is not CSV\n", " skip_leading_rows=1, # Skip header row (if present)\n", " autodetect=False, # Autodetect schema\n", " ignore_unknown_values = True #Ignore the id column generated by snowfakery\n", " )\n", "\n", " # Create and execute the load job\n", " load_job = bq_client.load_table_from_uri(\n", " blob.public_url, dest_table_ref, job_config=job_config\n", " )\n", "\n", " load_job.result() # Wait for the job to complete\n", "\n", " print(f\"Loaded data from gs://{bucket_name}/{blob_name} to {dest_table_ref}\")\n", "\n" ], "metadata": { "id": "IzpLImSzpCHn" }, "id": "IzpLImSzpCHn", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#List of fake functions to be used" ], "metadata": { "id": "gsS0A3gyUndD" }, "id": "gsS0A3gyUndD" }, { "cell_type": "code", "source": [ "fake_function_list=\"\"\"\n", "fake: Email\n", "fake: Name\n", "fake: Postalcode\n", "fake: State\n", "Address Fakers\n", "fake: City\n", "fake: Country\n", "fake: CurrentCountry\n", "fake: StreetAddress\n", "Company Fakers\n", "fake: CatchPhrase\n", "fake: Company\n", "Date_Time Fakers\n", "fake: Time\n", "fake: Year\n", "Lorem Fakers\n", "fake: Paragraph\n", "fake: Sentence\n", "fake: Text\n", "fake: Word\n", "Person Fakers\n", "fake: FirstName\n", "fake: FirstNameFemale\n", "fake: FirstNameMale\n", "fake: FirstNameNonbinary\n", "fake: LastName\n", "Phone_Number Fakers\n", "fake: PhoneNumber\n", "Rarely Used\n", "Salesforce Fakers\n", "fake: UserName\n", "fake: Alias\n", "fake: RealisticMaybeRealEmail\n", "Address Fakers\n", "fake: Address\n", "fake: AdministrativeUnit\n", "fake: BuildingNumber\n", "fake: CityPrefix\n", "fake: CitySuffix\n", "fake: CountryCode\n", "fake: CurrentCountryCode\n", "fake: MilitaryApo\n", "fake: MilitaryDpo\n", "fake: MilitaryShip\n", "fake: MilitaryState\n", "fake: PostalcodeInState\n", "fake: PostalcodePlus4\n", "fake: Postcode\n", "fake: PostcodeInState\n", "fake: SecondaryAddress\n", "fake: StateAbbr\n", "fake: StreetName\n", "fake: StreetSuffix\n", "fake: Zipcode\n", "fake: ZipcodeInState\n", "fake: ZipcodePlus4\n", "Automotive Fakers\n", "fake: LicensePlate\n", "Bank Fakers\n", "fake: Aba\n", "fake: BankCountry\n", "fake: Bban\n", "fake: Iban\n", "fake: Swift\n", "fake: Swift11\n", "fake: Swift8\n", "Barcode Fakers\n", "fake: Ean\n", "fake: Ean13\n", "fake: Ean8\n", "fake: LocalizedEan\n", "fake: LocalizedEan13\n", "fake: LocalizedEan8\n", "fake: UpcA\n", "fake: UpcE\n", "Color Fakers\n", "fake: Color\n", "fake: ColorName\n", "fake: HexColor\n", "fake: RgbColor\n", "fake: RgbCssColor\n", "fake: SafeColorName\n", "fake: SafeHexColor\n", "Company Fakers\n", "fake: Bs\n", "fake: CompanySuffix\n", "Credit_Card Fakers\n", "fake: CreditCardExpire\n", "fake: CreditCardFull\n", "fake: CreditCardNumber\n", "fake: CreditCardProvider\n", "fake: CreditCardSecurityCode\n", "Currency Fakers\n", "fake: Cryptocurrency\n", "fake: CryptocurrencyCode\n", "fake: CryptocurrencyName\n", "fake: Currency\n", "fake: CurrencyCode\n", "fake: CurrencyName\n", "fake: CurrencySymbol\n", "fake: Pricetag\n", "Date_Time Fakers\n", "fake: AmPm\n", "fake: Century\n", "fake.DateBetween:\n", "fake: DateObject\n", "fake: DateOfBirth\n", "fake: DayOfMonth\n", "fake: DayOfWeek\n", "fake: FutureDate\n", "fake: FutureDatetime\n", "fake: Iso8601\n", "fake: Month\n", "fake: MonthName\n", "fake: PastDate\n", "fake: PastDatetime\n", "fake: Pytimezone\n", "fake: TimeDelta\n", "fake: TimeObject\n", "fake: TimeSeries\n", "fake: Timezone\n", "fake: UnixTime\n", "Decorators.Py Fakers\n", "fake: AsciiCompanyEmail\n", "fake: AsciiEmail\n", "fake: AsciiFreeEmail\n", "fake: AsciiSafeEmail\n", "fake: CompanyEmail\n", "fake: DomainName\n", "fake: DomainWord\n", "fake: FreeEmail\n", "fake: FreeEmailDomain\n", "fake: Hostname\n", "fake: SafeDomainName\n", "fake: SafeEmail\n", "fake: Slug\n", "File Fakers\n", "fake: FileExtension\n", "fake: FileName\n", "fake: FilePath\n", "fake: MimeType\n", "fake: UnixDevice\n", "fake: UnixPartition\n", "Geo Fakers\n", "fake: Coordinate\n", "fake: Latitude\n", "fake: Latlng\n", "fake: LocalLatlng\n", "fake: LocationOnLand\n", "fake: Longitude\n", "Internet Fakers\n", "fake: Dga\n", "fake: HttpMethod\n", "fake: IanaId\n", "fake: ImageUrl\n", "fake: Ipv4\n", "fake: Ipv4NetworkClass\n", "fake: Ipv4Private\n", "fake: Ipv4Public\n", "fake: Ipv6\n", "fake: MacAddress\n", "fake: PortNumber\n", "fake: RipeId\n", "fake: Tld\n", "fake: Uri\n", "fake: UriExtension\n", "fake: UriPage\n", "fake: UriPath\n", "fake: Url\n", "Isbn Fakers\n", "fake: Isbn10\n", "fake: Isbn13\n", "Job Fakers\n", "fake: Job\n", "Lorem Fakers\n", "fake: Paragraphs\n", "fake: Sentences\n", "fake: Texts\n", "fake: Words\n", "Misc Fakers\n", "fake: Binary\n", "fake: Boolean\n", "fake: Csv\n", "fake: Dsv\n", "fake: FixedWidth\n", "fake: Image\n", "fake: Json\n", "fake: Md5\n", "fake: NullBoolean\n", "fake: Password\n", "fake: Psv\n", "fake: Sha1\n", "fake: Sha256\n", "fake: Tar\n", "fake: Tsv\n", "fake: Uuid4\n", "fake: Zip\n", "Person Fakers\n", "fake: LanguageName\n", "fake: LastNameFemale\n", "fake: LastNameMale\n", "fake: LastNameNonbinary\n", "fake: Name\n", "fake: NameFemale\n", "fake: NameMale\n", "fake: NameNonbinary\n", "fake: Prefix\n", "fake: PrefixFemale\n", "fake: PrefixMale\n", "fake: PrefixNonbinary\n", "fake: Suffix\n", "fake: SuffixFemale\n", "fake: SuffixMale\n", "fake: SuffixNonbinary\n", "Phone_Number Fakers\n", "fake: CountryCallingCode\n", "fake: Msisdn\n", "Profile Fakers\n", "fake: Profile\n", "fake: SimpleProfile\n", "Providers Fakers\n", "fake: Bothify\n", "fake: Hexify\n", "fake: LanguageCode\n", "fake: Lexify\n", "fake: Locale\n", "fake: Numerify\n", "Python Fakers\n", "fake: Pybool\n", "fake: Pydecimal\n", "fake: Pydict\n", "fake: Pyfloat\n", "fake: Pyint\n", "fake: Pyiterable\n", "fake: Pylist\n", "fake: Pyset\n", "fake: Pystr\n", "fake: PystrFormat\n", "fake: Pystruct\n", "fake: Pytuple\n", "Ssn Fakers\n", "fake: Ein\n", "fake: InvalidSsn\n", "fake: Itin\n", "fake: Ssn\n", "User_Agent Fakers\n", "fake: AndroidPlatformToken\n", "fake: Chrome\n", "fake: Firefox\n", "fake: InternetExplorer\n", "fake: IosPlatformToken\n", "fake: LinuxPlatformToken\n", "fake: LinuxProcessor\n", "fake: MacPlatformToken\n", "fake: MacProcessor\n", "fake: Opera\n", "fake: Safari\n", "fake: UserAgent\n", "fake: WindowsPlatformToken\n", "\"\"\"" ], "metadata": { "id": "Pvf-yrzYRJqS" }, "id": "Pvf-yrzYRJqS", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Prompt for Recipe Creation" ], "metadata": { "id": "DyS4voeOCsP_" }, "id": "DyS4voeOCsP_" }, { "cell_type": "code", "source": [ "def create_receipe_prompt(fake_function_list,table_schema):\n", " Recipe_Prompt= f\"\"\"\n", "Purpose and Goals:\n", "\n", "* Assist users in creating Snowfakery recipes based on their table schemas.\n", "* Generate recipes that accurately reflect the structure and data types of the input schemas.\n", "* Prioritize clarity and readability in the generated recipes.\n", "\n", "Behaviors and Rules:\n", "\n", "1. Schema Extraction:\n", " - Carefully parse the JSON schema provided by the user.\n", " - Extract relevant information for each column, including 'name', 'type', and 'description'.\n", "\n", "2. Recipe Generation:\n", " - Analyze the extracted column information.\n", " - Consult the Snowfakery documentation ([https://snowfakery.readthedocs.io/en/docs/fakedata.html](https://snowfakery.readthedocs.io/en/docs/fakedata.html)) to understand about each fake function usage\n", " - Use the below fake function list to determine appropriate fake functions for each column based on its type and description. {fake_function_list}\n", " - Construct a Snowfakery recipe in YAML format, using the identified fake functions and incorporating any relationships between tables.\n", " - For the fields similar to indentifier field, use ${{unique_id}}\n", "\n", "3. Receipe Validation:\n", " - Carefully read the user provided sample records and for each of the string type column, provide random_choice as arguments for each of the recipe fields from the sample records.\n", " - For integer, float, deciaml, bigdecimal fields, use the random_number with min and max values\n", " - Use random_choice and random_number, only if no fake function can be found for those fields. If a fake function is found for a particular field, stick to that fake function only, instead of random_choice and random_number\n", " - If there is any column dependencies based on the semantic meaning of the columns. Example Start_Date depends on End_Date, Start_time depends on End_Time.\n", " - Apply the below technique to create column dependencies. Make sure the indentations and colons are kept in the same way\n", " - Strictly validate the number of curly braces used. Example it should $ sign followed by have two ${{unique_id}}.\n", " - Strictly keep the provided table name as the yaml object name\n", "\n", "Example: Column dependencies between travel_start_date and travel_end_date, trip_start_datetime and trip_end_datetime ,random_choice for name field, id field and random_number is as below\n", "- object: trips\n", " count: 5\n", " fields:\n", " trip_id:${{unique_id}}\n", " trip_type:\n", " random_choice:\n", " - One Way\n", " - Round Trip\n", " name:\n", " fake: Name\n", " email:\n", " fake: Email\n", " age:\n", " random_number:\n", " min: 12\n", " max: 95\n", " travel_start_date:\n", " fake.DateBetween:\n", " start_date: -365d\n", " end_date: -1d\n", " travel_end_date:\n", " fake.DateBetween:\n", " start_date: +0d\n", " end_date: +365d\n", " - For Date and Datetime Column Types, strictly use \"fake.DateBetween:\" as mentioned in the example above.\n", " - Since fake function is available for the name and email field, respective fake function is used.\n", " - Since fake fuction is not avaialble for the trip_type and age, random_choice and random_number is used.\n", "\n", "-For any column which is semantic closer to the Identifier. Example Trip_Id, Plan_Id etc, always use the unique_id Faker function.\n", "\n", "- Make sure you follow the same output pattern as above without missing any colon, indentation, hypen, line breaks etc\n", "- Strictly include only the below list of fake functions\n", "\n", " {fake_function_list}\n", "\n", " 4. Recipe Presentation:\n", "\n", " - Present the generated Snowfakery recipe to the user in well formatted YAML format only.\n", " - Include comments in the recipe to explain any complex or non-obvious choices.\n", " - Remind the user that the generated recipe is a starting point and encourage them to customize and refine it as needed.\n", "\n", " 5. Validation:\n", " - Validate that the recipe doesnot inclue any fake funciton which is not part of the list shared earlier.\n", " - Validate that the random_choice is correctly added for all the string type fields\n", " - Strictly validate that the output is a valid YAML with no special characters or formatting issues.\n", " - Strictly validate the number of curly braces used. Example it should $ sign followed by have two curly braces like ${{unique_id}}.\n", " - Strictly Keep the same indentation and line breaks as provided in the example recipe.\n", " - Strictly keep object name same as the table_name in the provided input schema\n", "\n", "\n", "Additional Notes:\n", "* Strictly adhere to the fake functions defined in the Snowfakery documentation and the provided list. If a faker function is not in the provided list, do not include it in the recipe.\n", "* Strictly format the output in the valid YAML format with the correct formatting\n", "* Strictly Keep the same indentation and line breaks as provided in the example recipe.\n", "* Remove ```yaml and ``` from the output\n", "* The output should contain only the receip and no other explanation\n", "\n", "\n", "User Schmea Input:\n", "\n", "{table_schema}\n", "\n", "\"\"\"\n", " return Recipe_Prompt;\n" ], "metadata": { "id": "YKMvz2EGVrVE" }, "id": "YKMvz2EGVrVE", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Prompt for Recipe Correction" ], "metadata": { "id": "hZ2vwP47CwbF" }, "id": "hZ2vwP47CwbF" }, { "cell_type": "code", "source": [ "yaml_correction_prompt=\"\"\"You are a YAML text processing expert. Your task is to carefully analyze a given YAML text and modify it according to the following rule:\n", "\n", "Rule 1: If a field value within the YAML text contains only a single pair of curly braces (e.g., `${field}`), replace it with two pairs of curly braces (e.g., `${{field}}`).\n", "Rule 2: Convert the decimal values in the min and max arguments of the random_number function to the corresponding integer values\n", "\n", "**Input:**\n", "\n", "The user will provide you with a string of YAML text.\n", "\n", "**Output:**\n", "\n", "**Updated YAML:**\n", "1. Return the modified YAML text where all single curly brace expressions have been replaced with double curly braces.\n", "2. Return the modified YAML with the decmail fields replaced with the corresponding integer values\n", "\n", "**Important Considerations:**\n", "\n", "* **YAML Syntax:** Ensure that your modifications maintain the correct YAML syntax.\n", "* **Preservation:** Preserve the structure and indentation of the original YAML.\n", "* **Clarity:** Make your output clear and easy to understand.\n", "\n", "**Example:**\n", "\n", "If the user provides the following YAML:\n", "\n", "- object: employees\n", " count: 5\n", " fields:\n", " sales_id: ${unique_id}\n", " sales_name:\n", " random_choice:\n", " - Henry Jones\n", " - Ivy Taylor\n", " - Kelly White\n", " tax:\n", " random_number:\n", " min: 21.87\n", " max: 48.98\n", " win_id: ${unique_id}\n", "\n", "Your Output should be\n", "- object: employees\n", " count: 5\n", " fields:\n", " sales_id: ${{unique_id}} --Added extra curly braces\n", " sales_name:\n", " random_choice:\n", " - Henry Jones\n", " - Ivy Taylor\n", " - Kelly White\n", " tax:\n", " random_number:\n", " min: 21 -- Converted from Decimal to Integer\n", " max: 48 -- Converted from Decimal to Integer\n", " win_id: ${{unique_id}} --Added extra curly braces\n", "\n", "* Strictly Keep the same indentation and line breaks as provided in the example recipe.\n", "* Strictly keep the same object name as in the input\n", "* Remove ```yaml and ``` from the output\n", "* The output should contain only the receip and no other explanation\n", "\"\"\"\n" ], "metadata": { "id": "Lvx33NNl2cOQ" }, "id": "Lvx33NNl2cOQ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Prompt for building the relationship between the tables" ], "metadata": { "id": "JKjWjN2eC1CL" }, "id": "JKjWjN2eC1CL" }, { "cell_type": "code", "source": [ "Relationship_builder=\"\"\"Purpose: Given the schema of multiple database tables, determine their relationships and the type of each table.\n", "\n", "\n", "\n", "Input Format:\n", "\n", "* A string containing the table definitions. Each definition includes:\n", "\n", " * Table name.\n", "\n", " * Column names and their data types.\n", "\n", "\n", "\n", "Output Format:\n", "\n", "* A JSON array where each element represents a table and contains:\n", "\n", " * table_name (string): The name of the table.\n", "\n", " * table_type (string): Either 'dimension' or 'fact'.\n", "\n", " * foreign_keys (array):\n", "\n", " * column_name (string): The name of the foreign key column in this table.\n", "\n", " * references_table (string): The name of the referenced table.\n", "\n", " * references_column (string): The name of the referenced column.\n", "\n", " * relationship_type (string): Either 'One-to-Many (1:M)' or 'Many-to-One (M:1)'. If the table is a dimension table, this array is empty.\n", "\n", "\n", "\n", "Rules and Instructions:\n", "\n", "1. **Schema Parsing:**\n", "\n", " * Identify each table, its name, columns, and data types.\n", "\n", "\n", "\n", "2. **Table Type Determination:**\n", "\n", " * Classify each table as either:\n", "\n", " * **Dimension:** Primarily stores descriptive attributes, often used to filter or group data in fact tables.\n", "\n", " * **Fact:** Primarily stores measurable or quantitative data, often related to events or transactions.\n", "\n", " * If unsure, make an educated guess based on common patterns\n", "\n", "\n", "\n", "3. **Relationship Identification:**\n", "\n", " * Look for foreign key relationships between tables.\n", "\n", " * Foreign keys are columns that reference the primary key of another table.\n", "\n", " * Determine the type of relationship:\n", "\n", " * **One-to-Many (1:M):** One record in the dimension table can be associated with many records in the fact table.\n", "\n", " * **Many-to-One (M:1):** Many records in the fact table can be associated with one record in the dimension table.\n", "\n", "\n", "\n", "4. **Output Generation:**\n", "\n", " * For each table, create a JSON object following the specified format.\n", "\n", " * Include relevant foreign key information in the `foreign_keys` array.\n", "\n", " * Ensure the output is valid JSON.\n", "\n", "\n", "Additional Notes:\n", "\n", "* This is a simplified approach. Real-world schema analysis can be more complex.\n", "\n", "* For large schemas, you might implement more advanced algorithms for relationship discovery.\n", "\"\"\"" ], "metadata": { "id": "Ueyg3gSyBHnC" }, "id": "Ueyg3gSyBHnC", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Prompt for updating the Recipe based on the table relationship" ], "metadata": { "id": "gPh_w4EWC8mD" }, "id": "gPh_w4EWC8mD" }, { "cell_type": "code", "source": [ "Relationship_Prompt=\"\"\"Below is the Example of Snowfakery yml file for tables emp and dept with minimal number of columns to demonstrate how primary key and foreign key can be represented while creating recipe.\n", "- object: dept\n", " count: 4\n", " fields:\n", " deptno: ${{unique_id}}\n", " name:\n", " random_choice:\n", " - Henry Jones\n", " - Ivy Taylor\n", " - Kelly White\n", "- object: emp\n", " count: 10\n", " fields:\n", " __dept_fk:\n", " random_reference:\n", " to: dept\n", " unique: False\n", " empno: ${{unique_id}}\n", " deptno: ${{__dept_fk.deptno}}\n", "\n", "\n", "## Snowfakery Recipe Breakdown: Department and Employee Data Generation\n", "\n", "This Snowfakery recipe generates synthetic data for two related objects: `dept` (department) and `emp` (employee). It leverages several key features of Snowfakery to create realistic and interconnected data.\n", "\n", "**Object: dept**\n", "\n", "- **`object: dept`**: This line declares the first object to be generated, representing a department.\n", "- **`count: 4`**: This specifies that four department records will be created.\n", "- **`fields:`**: This introduces the fields that will be populated for each department record.\n", " - **`deptno: ${{unique_id}}`**: This assigns a unique ID to each department using the `unique_id` function, ensuring each department has a distinct identifier.\n", " - **`name: fake: Name`**: This generates a fake department name using the `fake` function with the `Name` provider. This will produce random, yet plausible, department names.\n", "\n", "**Object: emp**\n", "\n", "- **`object: emp`**: This line declares the second object to be generated, representing an employee.\n", "- **`count: 10`**: This specifies that ten employee records will be created.\n", "- **`fields:`**: This introduces the fields that will be populated for each employee record.\n", " - **`__dept_fk: random_reference: to: dept unique: False`**: This is a hidden field used to establish a relationship between employees and departments. It utilizes the `random_reference` function to randomly assign a department to each employee.\n", " - **`to: dept`**: This specifies that the reference should point to a `dept` object.\n", " - **`unique: False`**: This allows multiple employees to be assigned to the same department, reflecting a realistic scenario.\n", " - **`empno: ${{unique_id}}`**: This assigns a unique ID to each employee using the `unique_id` function, ensuring each employee has a distinct identifier.\n", " - **`deptno: ${{__dept_fk.deptno}}`**: This field populates the employee's department number by referencing the `deptno` field of the randomly assigned department from the hidden `__dept_fk` field. This creates a direct link between the employee and their department.\n", "\n", "**Recipe Execution**\n", "\n", "When this recipe is executed, Snowfakery will first generate four unique department records with random names. Subsequently, it will create ten employee records, each randomly assigned to one of the existing departments. The `deptno` field in the `emp` object will reflect the `deptno` of the assigned department, ensuring data consistency and reflecting a realistic one-to-many relationship between departments and employees.\n", "\n", "**Key Features Utilized**\n", "\n", "- **`unique_id` function**: Generates unique identifiers for both departments and employees.\n", "- **`fake` function**: Generates fake department names using the `Name` provider.\n", "- **`random_reference` function**: Randomly assigns departments to employees.\n", "- **Hidden fields**: Used to store intermediate values and establish relationships without appearing in the final output.\n", "\n", "This recipe demonstrates the power and flexibility of Snowfakery in generating synthetic data with realistic relationships between objects. It can be easily adapted to generate data for other scenarios by modifying the object names, field definitions, and relationships as needed.\n", "\n", "\n", "Task:\n", "* Correct the below provided Snowfakery recipe (yml) files to include correct primary key generation and foreign key references.\n", "* Use the provided relationship\n", "* Remove ```yaml and ``` from the output\n", "* The output should contain only the receip and no other explanation\n", "* Strictly Keep the same indentation and line breaks as provided in the example recipe.\n", "* Strictly keep the provided table name as the yaml object name\n", "\n", "\n", "Here's a step-by-step guide for completing your tasks:\n", "1. For recipes having ONLY PRIMARY KEY include unique_id as shown in the dept, emp example recipe.\n", "2. For recipes having FOREIGN KEY reference include\n", "a. Hidden field referring to Primary table recipe\n", "PS: unique=False for 'One-to-Many (1:M)' relationship and unique=True for 'One-to-One (1:1)' relationship\n", "b. Hidden fields should be placed right after `fields` keyword of that particular object\n", "c. Replace the foreign key column value to refer to primary column recipe\n", "3. When there are conflicting scenarios between\n", "4. All other columns will remain as it is. No changes are required for other columns.\n", "5. You are allowed to rearrange the order of yaml recipe to get desired output (hint: try building independent objects i.e., without foreign key first and only after that bring in dependent objects)\n", "6. Strictly Keep the same indentation and line breaks as provided in the example recipe.\n", "7. Strictly keep the same object name as in the input\n", "8. Always add - before the \"object\" keyword in the yaml recipe\n", "9. Just provide output yaml; no explaination is needed\n", "\"\"\"" ], "metadata": { "id": "GztCgo47lWw8" }, "id": "GztCgo47lWw8", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#Prompt for correcting the object name in the Recipe" ], "metadata": { "id": "Uj6ylzNKDC6V" }, "id": "Uj6ylzNKDC6V" }, { "cell_type": "code", "source": [ "Correct_Object_Name_Prompt=\"\"\" Your task is to read the relationship input between tables, extract the table name form the relationship.\n", "Then, read the provided yaml receipe and replace the object name with the name of the table names extracted from the relationship.\n", "* Just provide output yaml; no explaination is needed\n", "* Remove ```yaml and ``` from the output\n", "* Always add - before the \"object\" keyword in the yaml recipe\n", "* Strictly Keep the same indentation and line breaks as provided in the input recipe.\n", "\"\"\"" ], "metadata": { "id": "-zRvkiNcrIsL" }, "id": "-zRvkiNcrIsL", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "id": "25bf5867", "metadata": { "id": "25bf5867" }, "source": [ "### Initialize model\n", "\n", "Initialize the Gemini model with the desired model parameters and `Tool` that we defined earlier:" ] }, { "cell_type": "code", "execution_count": null, "id": "9938f4ec-edd7-4738-85bd-ed125bdd54c4", "metadata": { "tags": [], "id": "9938f4ec-edd7-4738-85bd-ed125bdd54c4" }, "outputs": [], "source": [ "gemini_model = GenerativeModel(\n", " \"gemini-1.5-pro-001\",\n", " generation_config=GenerationConfig(temperature=0)\n", ")" ] }, { "cell_type": "code", "source": [ "def generate_receipes(source_table_list,gcs_bucket_name,total_rows_to_fetch_from_source_table):\n", " \"\"\"\n", " Generates Snowfakery recipes for a list of BigQuery source tables.\n", "\n", " Args:\n", " source_table_list: A semicolon-separated list of source BigQuery tables.\n", " gcs_bucket_name: The name of the GCS bucket for storing data.\n", " total_rows_to_fetch_from_source_table: The number of rows to fetch from each source table.\n", "\n", " Returns:\n", " A tuple containing the combined YAML recipe and the combined schema.\n", " \"\"\"\n", " source_tables = source_table_list.split(';')\n", " combined_yaml=''\n", " combined_schema=''\n", " for src_table in source_tables:\n", " source_table = src_table.split('.')\n", " source_project_id = source_table[0]\n", " source_dataset_id = source_table[1]\n", " source_table_id = source_table[2]\n", " table_schema=extract_schema(source_project_id,source_dataset_id,source_table_id)\n", " combined_schema=combined_schema+\"Table Name:\"+source_table_id+table_schema\n", " Recipe_Prompt=create_receipe_prompt(fake_function_list,table_schema)\n", " current_timestamp = datetime.now().strftime(\"%Y%m%d%H%M%S%f\")\n", " source_file_name = f\"{source_project_id}/{source_dataset_id}/{source_table_id}/input/extract_{current_timestamp}.csv\"\n", " read_table_and_upload_to_gcs(source_project_id, source_dataset_id, source_table_id, gcs_bucket_name, source_file_name, total_rows_to_fetch_from_source_table)\n", " source_file_path = f\"gs://{gcs_bucket_name}/{source_file_name}\"\n", " bigquery_records = Part.from_uri(source_file_path, mime_type=\"text/csv\")\n", " yaml_output = gemini_model.generate_content([Recipe_Prompt, \"Sample Records:\", bigquery_records]).text\n", " chat_yaml_corection = gemini_model.start_chat()\n", " final_yaml_output=chat_yaml_corection.send_message(yaml_correction_prompt + \"\"\" **Input:** \"\"\" + yaml_output).text\n", " combined_yaml=combined_yaml+final_yaml_output\n", " return combined_yaml,combined_schema" ], "metadata": { "id": "ZL_Bc4bpWGqC" }, "id": "ZL_Bc4bpWGqC", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def generate_output_data(source_table_list,gcs_bucket_name,total_rows_to_fetch_from_source_table):\n", " \"\"\"\n", " Generates synthetic data and loads it into BigQuery tables.\n", "\n", " This function orchestrates the process of generating synthetic data\n", " based on source BigQuery tables and loading it into target BigQuery tables.\n", " It involves generating Snowfakery recipes, extracting table relationships,\n", " creating and uploading data to GCS, and finally loading the data into BigQuery.\n", " Args:\n", " source_table_list: A semicolon-separated list of source BigQuery tables.\n", " gcs_bucket_name: The name of the GCS bucket for storing data.\n", " total_rows_to_fetch_from_source_table: The number of rows to fetch from each source table.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " combined_yaml,combined_schema=generate_receipes(source_table_list,gcs_bucket_name,total_rows_to_fetch_from_source_table)\n", " relationship_output= gemini_model.generate_content([Relationship_builder,combined_schema])\n", " print(relationship_output.text)\n", " final_yaml_output = gemini_model.generate_content([Relationship_Prompt,relationship_output.text,combined_yaml])\n", " corrected_yaml_output = gemini_model.generate_content([Correct_Object_Name_Prompt,relationship_output.text,final_yaml_output.text])\n", " recipe = corrected_yaml_output.text\n", " # Load the YAML recipe\n", " yaml_data = yaml.safe_load(recipe)\n", "\n", " # Update the counts for each object\n", " for item in yaml_data:\n", " object_name = item.get('object')\n", " if object_name in user_counts:\n", " item['count'] = user_counts[object_name]\n", "\n", " # Dump the updated YAML recipe\n", " updated_recipe = yaml.dump(yaml_data, sort_keys=False)\n", " print(updated_recipe)\n", " generate_data_with_user_input(updated_recipe)\n", "\n", " source_tables = source_table_list.split(';')\n", " target_tables = target_table_list.split(';')\n", "\n", " # Create a dictionary to map source tables to target tables\n", " table_map = {}\n", "\n", " # Iterate through the lists and create the mapping\n", " for i in range(len(source_tables)):\n", " source_table_name = source_tables[i] # Extract table name from fully qualified name\n", " target_table_name = target_tables[i]\n", " table_map[source_table_name] = target_table_name\n", "\n", " # Load the JSON data\n", " data = json.loads(json.dumps(table_map))\n", "\n", " # Iterate through the key-value pairs\n", " for key, value in data.items():\n", " source_table_name=key.split('.')[-1]\n", " source_dataset_name=key.split('.')[-2]\n", " source_project_id=key.split('.')[-3]\n", " target_table_name=value.split('.')[-1]\n", " target_dataset_name=value.split('.')[-2]\n", " target_project_id=value.split('.')[-3]\n", " now = datetime.now()\n", " current_timestamp = now.strftime(\"%Y%m%d%H%M%S%f\")\n", " target_file_name=f\"{target_project_id}/{target_dataset_name}/{target_table_name}/output/generated_{current_timestamp}.csv\"\n", " local_file_path=f\"/output/{source_table_name}.csv\"\n", " upload_from_local_to_gcs(gcs_bucket_name, target_file_name, local_file_path)\n", " gcs_to_bigquery(source_project_id,target_project_id,source_dataset_name,target_dataset_name, source_table_name,target_table_name, gcs_bucket_name, target_file_name)\n", " return None\n" ], "metadata": { "id": "XFhgpC4tx9IT" }, "id": "XFhgpC4tx9IT", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#Generate Synthetic Data by calling the below function\n", "generate_output_data(source_table_list,gcs_bucket_name,total_rows_to_fetch_from_source_table)" ], "metadata": { "id": "HkK7ckkRkiwS" }, "id": "HkK7ckkRkiwS", "execution_count": null, "outputs": [] } ], "metadata": { "environment": { "kernel": "conda-root-py", "name": "workbench-notebooks.m115", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/workbench-notebooks:m115" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" }, "colab": { "provenance": [] } }, "nbformat": 4, "nbformat_minor": 5 }

tools/synthetic data generation/synthetic_data_generation_using_llm.ipynb (1,522 lines of code) (raw):